Influence of Intrinsic Property Characteristics and External Environment on the Valuation in the Melbourne Housing Market



Introduction: The housing market in Melbourne is a complex and dynamic environment where various factors interplay to determine the value of properties. In this project, we delve into an exploratory data analysis (EDA) of the Melbourne housing market to uncover key insights into how property prices are influenced by both intrinsic characteristics, such as the number of rooms, and external factors, including location and proximity to the city center.

The goal of this analysis is to identify patterns and relationships within the dataset that drive property valuations. By the end of the project, we aim to develop a comprehensive understanding of the factors that most significantly affect house prices. This will culminate in a regression analysis, where we will quantify the impact of these variables and create a predictive model that can be used to estimate property values based on the identified features.

Importing Data and Dependencies

library(ggplot2)
library(dplyr)
library(GGally)
library(lubridate)
library(corrplot)
library(scales)
library(broom)
library(car)
#Importing Data
data <- read.csv("Data/MELBOURNE_HOUSE_PRICES_LESS.csv", stringsAsFactor = FALSE)
head(data)
##         Suburb          Address Rooms Type   Price Method  SellerG      Date
## 1   Abbotsford    49 Lithgow St     3    h 1490000      S   Jellis 1/04/2017
## 2   Abbotsford    59A Turner St     3    h 1220000      S Marshall 1/04/2017
## 3   Abbotsford    119B Yarra St     3    h 1420000      S   Nelson 1/04/2017
## 4   Aberfeldie       68 Vida St     3    h 1515000      S    Barry 1/04/2017
## 5 Airport West 92 Clydesdale Rd     2    h  670000      S   Nelson 1/04/2017
## 6 Airport West     4/32 Earl St     2    t  530000      S   Jellis 1/04/2017
##   Postcode            Regionname Propertycount Distance
## 1     3067 Northern Metropolitan          4019      3.0
## 2     3067 Northern Metropolitan          4019      3.0
## 3     3067 Northern Metropolitan          4019      3.0
## 4     3040  Western Metropolitan          1543      7.5
## 5     3042  Western Metropolitan          3464     10.4
## 6     3042  Western Metropolitan          3464     10.4
##                  CouncilArea
## 1         Yarra City Council
## 2         Yarra City Council
## 3         Yarra City Council
## 4 Moonee Valley City Council
## 5 Moonee Valley City Council
## 6 Moonee Valley City Council



1- Overview of the Dataset

glimpse(data)
## Rows: 63,023
## Columns: 13
## $ Suburb        <chr> "Abbotsford", "Abbotsford", "Abbotsford", "Aberfeldie", …
## $ Address       <chr> "49 Lithgow St", "59A Turner St", "119B Yarra St", "68 V…
## $ Rooms         <int> 3, 3, 3, 3, 2, 2, 2, 3, 6, 3, 3, 4, 2, 4, 2, 4, 3, 2, 2,…
## $ Type          <chr> "h", "h", "h", "h", "h", "t", "u", "h", "h", "h", "u", "…
## $ Price         <int> 1490000, 1220000, 1420000, 1515000, 670000, 530000, 5400…
## $ Method        <chr> "S", "S", "S", "S", "S", "S", "S", "SP", "PI", "S", "S",…
## $ SellerG       <chr> "Jellis", "Marshall", "Nelson", "Barry", "Nelson", "Jell…
## $ Date          <chr> "1/04/2017", "1/04/2017", "1/04/2017", "1/04/2017", "1/0…
## $ Postcode      <int> 3067, 3067, 3067, 3040, 3042, 3042, 3042, 3042, 3021, 32…
## $ Regionname    <chr> "Northern Metropolitan", "Northern Metropolitan", "North…
## $ Propertycount <int> 4019, 4019, 4019, 1543, 3464, 3464, 3464, 3464, 1899, 32…
## $ Distance      <dbl> 3.0, 3.0, 3.0, 7.5, 10.4, 10.4, 10.4, 10.4, 14.0, 3.0, 1…
## $ CouncilArea   <chr> "Yarra City Council", "Yarra City Council", "Yarra City …
summary(data)
##     Suburb            Address              Rooms            Type          
##  Length:63023       Length:63023       Min.   : 1.000   Length:63023      
##  Class :character   Class :character   1st Qu.: 3.000   Class :character  
##  Mode  :character   Mode  :character   Median : 3.000   Mode  :character  
##                                        Mean   : 3.111                     
##                                        3rd Qu.: 4.000                     
##                                        Max.   :31.000                     
##                                                                           
##      Price             Method            SellerG              Date          
##  Min.   :   85000   Length:63023       Length:63023       Length:63023      
##  1st Qu.:  620000   Class :character   Class :character   Class :character  
##  Median :  830000   Mode  :character   Mode  :character   Mode  :character  
##  Mean   :  997898                                                           
##  3rd Qu.: 1220000                                                           
##  Max.   :11200000                                                           
##  NA's   :14590                                                              
##     Postcode     Regionname        Propertycount      Distance    
##  Min.   :3000   Length:63023       Min.   :   39   Min.   : 0.00  
##  1st Qu.:3056   Class :character   1st Qu.: 4380   1st Qu.: 7.00  
##  Median :3107   Mode  :character   Median : 6795   Median :11.40  
##  Mean   :3126                      Mean   : 7618   Mean   :12.68  
##  3rd Qu.:3163                      3rd Qu.:10412   3rd Qu.:16.70  
##  Max.   :3980                      Max.   :21650   Max.   :64.10  
##                                                                   
##  CouncilArea       
##  Length:63023      
##  Class :character  
##  Mode  :character  
##                    
##                    
##                    
## 

In this initial step, we can already observe some interesting insights. First, we now have a clear understanding of all the variables present in the dataset. Additionally, certain details stand out, such as the maximum number of rooms, which is unusually high. Although 31 rooms is not impossible, it is significantly far from the median, warranting further investigation.

We also noticed that the Price variable contains missing values (NaNs). Given that Price is a crucial variable for this analysis, it is essential to address these missing values to ensure the accuracy and reliability of our results. Additionally, we have identified that the data types of some variables may need correction.

In the subsequent steps, we will delve deeper into these observations, starting with a thorough exploration of the outliers and a comprehensive strategy for handling missing data.



2- Data Pre-processing

2.1 - Data Types

data <- data %>%
  mutate(across(c(Suburb, Type, Method, Regionname, CouncilArea),as.factor),
         Date = as.Date(Date, format = "%d/%m/%Y"))

2.2 - Missing Values

#Dealing with Missing Values
na_sum <- data %>%
  filter(is.na(Price) > 0) %>%
  nrow()
freq_na <- round(na_sum/nrow(data) * 100, digits = 2)

#Ploting NA %
na_data <- data.frame(
  category = c("NA", "Not NA"),
  count = c(na_sum, nrow(data) - na_sum)
)

ggplot(na_data, aes(x = "", y = count, fill = category)) +
  geom_bar(width = 1, stat = "identity") +
  coord_polar("y", start = 0) +
  labs(title = "Frequency of NA Values in the Price Column") +
  theme_void() + 
  theme(legend.title = element_blank()) +  
  scale_fill_manual(values = c("NA" = "lightblue", "Not NA" = "skyblue")) +
  geom_text(aes(label = paste0(round(count / sum(count) * 100, 2), "%")),
            position = position_stack(vjust = 0.5))

#Remove NA'S
data_cleaned <- data %>%
  filter(rowSums(is.na(.)) == 0)

glimpse(data_cleaned)
## Rows: 48,433
## Columns: 13
## $ Suburb        <fct> Abbotsford, Abbotsford, Abbotsford, Aberfeldie, Airport …
## $ Address       <chr> "49 Lithgow St", "59A Turner St", "119B Yarra St", "68 V…
## $ Rooms         <int> 3, 3, 3, 3, 2, 2, 2, 3, 3, 3, 4, 2, 4, 2, 3, 2, 2, 3, 2,…
## $ Type          <fct> h, h, h, h, h, t, u, h, h, u, h, h, h, h, h, u, h, h, u,…
## $ Price         <int> 1490000, 1220000, 1420000, 1515000, 670000, 530000, 5400…
## $ Method        <fct> S, S, S, S, S, S, S, SP, S, S, S, S, S, SP, S, S, S, S, …
## $ SellerG       <chr> "Jellis", "Marshall", "Nelson", "Barry", "Nelson", "Jell…
## $ Date          <date> 2017-04-01, 2017-04-01, 2017-04-01, 2017-04-01, 2017-04…
## $ Postcode      <int> 3067, 3067, 3067, 3040, 3042, 3042, 3042, 3042, 3206, 30…
## $ Regionname    <fct> Northern Metropolitan, Northern Metropolitan, Northern M…
## $ Propertycount <int> 4019, 4019, 4019, 1543, 3464, 3464, 3464, 3464, 3280, 21…
## $ Distance      <dbl> 3.0, 3.0, 3.0, 7.5, 10.4, 10.4, 10.4, 10.4, 3.0, 10.5, 1…
## $ CouncilArea   <fct> Yarra City Council, Yarra City Council, Yarra City Counc…
summary(data_cleaned)
##             Suburb        Address              Rooms        Type     
##  Reservoir     : 1067   Length:48433       Min.   : 1.000   h:34161  
##  Bentleigh East:  696   Class :character   1st Qu.: 2.000   t: 4980  
##  Richmond      :  642   Mode  :character   Median : 3.000   u: 9292  
##  Craigieburn   :  598                      Mean   : 3.072            
##  Preston       :  593                      3rd Qu.: 4.000            
##  Mount Waverley:  556                      Max.   :31.000            
##  (Other)       :44281                                                
##      Price              Method        SellerG               Date           
##  Min.   :   85000   S      :30624   Length:48433       Min.   :2016-01-28  
##  1st Qu.:  620000   SP     : 6480   Class :character   1st Qu.:2016-12-03  
##  Median :  830000   PI     : 5940   Mode  :character   Median :2017-08-26  
##  Mean   :  997898   VB     : 5024                      Mean   :2017-07-31  
##  3rd Qu.: 1220000   SA     :  365                      3rd Qu.:2018-03-03  
##  Max.   :11200000   PN     :    0                      Max.   :2018-10-13  
##                     (Other):    0                                          
##     Postcode                         Regionname    Propertycount  
##  Min.   :3000   Northern Metropolitan     :13598   Min.   :   39  
##  1st Qu.:3051   Southern Metropolitan     :12549   1st Qu.: 4280  
##  Median :3103   Western Metropolitan      : 9680   Median : 6567  
##  Mean   :3123   Eastern Metropolitan      : 7585   Mean   : 7566  
##  3rd Qu.:3163   South-Eastern Metropolitan: 4010   3rd Qu.:10412  
##  Max.   :3980   Northern Victoria         :  455   Max.   :21650  
##                 (Other)                   :  556                  
##     Distance                     CouncilArea   
##  Min.   : 0.0   Darebin City Council   : 3462  
##  1st Qu.: 7.0   Boroondara City Council: 3455  
##  Median :11.7   Banyule City Council   : 2902  
##  Mean   :12.7   Brimbank City Council  : 2720  
##  3rd Qu.:16.7   Moreland City Council  : 2519  
##  Max.   :55.8   Bayside City Council   : 2495  
##                 (Other)                :30880

2.3- Duplicate Values

dup <- data_cleaned %>%
  filter(duplicated(.))
dup
##          Suburb        Address Rooms Type   Price Method  SellerG       Date
## 1 Fitzroy North 5/16 Taplin St     2    h 1010000     SP Woodards 2018-05-05
##   Postcode            Regionname Propertycount Distance           CouncilArea
## 1     3068 Northern Metropolitan          6244      3.6 Moreland City Council
data_cleaned <- data_cleaned %>%
  distinct()

Upon reviewing the dataset, we learned that 23.15% of the values in the Price column are missing. While imputing missing values is a common technique to preserve data and maintain sample size, the decision to do so must be carefully weighed against the potential risks it introduces, particularly when dealing with a large proportion of missing data.

Given the significant proportion of missing values in Price, Imputing such a large percentage of missing values might cause the model to “learn” from artificially introduced data rather than the underlying true patterns. This can result in a model that performs well on training data but fails to generalize to new, unseen data. The imputed values, especially if based on simple methods like mean or median, could introduce patterns that do not exist in the real data, leading to a false sense of accuracy. Additionally, even after removing the entries with missing values, we still have a substantial number of records—48,433 to be precise. This is more than sufficient to continue with a robust analysis. Regarding duplicated values, we identified and eliminated a duplicate entry, ensuring that the dataset is clean and free from redundancy.



2.4- Dealing with outliers

room_dist <- ggplot(data_cleaned, aes(x = as.factor(Rooms))) +
  geom_bar(fill = "skyblue", color = "white") +
  labs(x = "Rooms", y = "Count", title = "Frequency of Number of Rooms") +
  theme_classic()
room_dist

rooms_freq <- as.data.frame(table(data_cleaned$Rooms))
rooms_freq
##    Var1  Freq
## 1     1  1670
## 2     2 10673
## 3     3 21812
## 4     4 11576
## 5     5  2350
## 6     6   283
## 7     7    36
## 8     8    19
## 9     9     2
## 10   10     6
## 11   11     1
## 12   12     2
## 13   16     1
## 14   31     1
data_cleaned<- data_cleaned %>% filter(Rooms <= 8 )
hist(data$Price, main = "Frequency of Price", xlab = "Price", col = "skyblue", border = "white",)

hist(data_cleaned$Propertycount, main = "Frequency of Property Count", xlab = "Property Count", col = "skyblue", 
     border = "white")

hist(data_cleaned$Distance, main = "Frequency of Distance", xlab = "Distance", col = "skyblue", border = "white",)

ggplot(data_cleaned, aes(x=Distance))+
  geom_boxplot()

data_cleaned$Log_Price <- log(data_cleaned$Price)
hist(data_cleaned$Price, main="Frequency of Original Price", xlab="Price", col = "skyblue", border = "white",)

hist(data_cleaned$Log_Price, main="Frequency of Log-Transformed Price", xlab="Log(Price)", col = "skyblue", border = "white")

In this analysis, our goal is to conduct a comprehensive market study, which means we will aim to retain most of the outliers. Many of the continuous variables in our dataset contain some proportion of outliers.

Let’s start by analyzing the Price variable. This variable exhibits a significant right skew. The outliers here do not appear to be imputation errors. If our focus were on a more conservative market analysis, we might have opted to remove these values. However, this is not our objective in this analysis. Instead, we have chosen to retain these outliers and apply a log transformation to Price to normalize its distribution.

When it comes to the Rooms variable, this was the only variable where we decided to remove some values. Specifically, we chose to exclude houses with more than 8 rooms. This decision was based on the relatively small number of observations in each of these higher categories, which could influence the results.

For the rest of the variables, we have decided to keep the outliers intact, consistent with our goal of conducting a comprehensive market analysis. When we proceed to the regression analysis, we will evaluate the impact of this decision on our model.

3- Exploratory Data Analysis (EDA)

Now that we have cleaned our data, we can start analyzing our variables to uncover meaningful insights. Our first step will involve exploring the distributions and relationships between key variables to understand their impact on property prices. We will divide this section into three parts:

3.1 Univariate Analysis: We will begin by examining the distribution of individual variables. This includes looking at the range, central tendency (mean, median), and variability (standard deviation) for numerical variables such as Rooms, Distance, and Propertycount. For categorical variables like Suburb, Type, Method, and CouncilArea, we will analyze the frequency distribution to identify the most common categories.

3.2 Bivariate Analysis: Next, we will explore the relationships between pairs of variables. This will include scatter plots and correlation matrices to identify potential linear relationships between Price and other numerical variables, as well as box plots to examine how Price varies across different categories, such as Type or Regionname.

3.3 Multivariate Analysis: Finally, we will perform a multivariate analysis to understand the combined effect of multiple variables on Price. Techniques such as regression analysis will be used to quantify the impact of each variable while controlling for others.



3.1 Univariate Analysis Numerical Variables



ggplot(data_cleaned, aes(Log_Price)) +
  geom_histogram(fill = "skyblue", color = "white")+
  labs(x = "Log(Price)", y = "Frequency", title = "Frequency of Log_Price" ) +
  theme_minimal()
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

ggplot(data_cleaned, aes(Rooms)) +
  geom_bar(fill = "skyblue", color = "white")+
  labs(x = "Rooms", y = "Frequency", title = "Frequency of Rooms" ) +
  theme_minimal()

ggplot(data_cleaned, aes(Propertycount)) +
  geom_histogram(fill = "skyblue", color = "white")+
  labs(x = "Property Count", y = "Frequency", title = "Frequency of Propertycount" ) +
  theme_minimal()
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

ggplot(data_cleaned, aes(Distance)) +
  geom_histogram(fill = "skyblue", color = "white")+
  labs(x = "Distance", y = "Frequency", title = "Frequency of Distance" ) +
  theme_minimal()
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

Categorical Variables

summary(data_cleaned)
##             Suburb        Address              Rooms       Type     
##  Reservoir     : 1067   Length:48419       Min.   :1.000   h:34147  
##  Bentleigh East:  696   Class :character   1st Qu.:2.000   t: 4980  
##  Richmond      :  642   Mode  :character   Median :3.000   u: 9292  
##  Craigieburn   :  598                      Mean   :3.069            
##  Preston       :  593                      3rd Qu.:4.000            
##  Mount Waverley:  556                      Max.   :8.000            
##  (Other)       :44267                                               
##      Price              Method        SellerG               Date           
##  Min.   :   85000   S      :30616   Length:48419       Min.   :2016-01-28  
##  1st Qu.:  620000   SP     : 6477   Class :character   1st Qu.:2016-12-03  
##  Median :  830000   PI     : 5938   Mode  :character   Median :2017-08-26  
##  Mean   :  997456   VB     : 5023                      Mean   :2017-07-31  
##  3rd Qu.: 1220000   SA     :  365                      3rd Qu.:2018-03-03  
##  Max.   :11200000   PN     :    0                      Max.   :2018-10-13  
##                     (Other):    0                                          
##     Postcode                         Regionname    Propertycount  
##  Min.   :3000   Northern Metropolitan     :13593   Min.   :   39  
##  1st Qu.:3051   Southern Metropolitan     :12545   1st Qu.: 4280  
##  Median :3103   Western Metropolitan      : 9680   Median : 6567  
##  Mean   :3123   Eastern Metropolitan      : 7583   Mean   : 7567  
##  3rd Qu.:3163   South-Eastern Metropolitan: 4008   3rd Qu.:10412  
##  Max.   :3980   Northern Victoria         :  455   Max.   :21650  
##                 (Other)                   :  555                  
##     Distance                     CouncilArea      Log_Price    
##  Min.   : 0.0   Darebin City Council   : 3462   Min.   :11.35  
##  1st Qu.: 7.0   Boroondara City Council: 3454   1st Qu.:13.34  
##  Median :11.7   Banyule City Council   : 2900   Median :13.63  
##  Mean   :12.7   Brimbank City Council  : 2720   Mean   :13.68  
##  3rd Qu.:16.7   Moreland City Council  : 2517   3rd Qu.:14.01  
##  Max.   :55.8   Bayside City Council   : 2495   Max.   :16.23  
##                 (Other)                :30871
n_suburb <- length(levels(data_cleaned$Suburb))
paste("There are", n_suburb, "levels in Surbub")
## [1] "There are 380 levels in Surbub"

There are to many levels in Suburb. Lets see the 20 suburbs with more observations

suburb_frequencies<- data_cleaned %>%
  group_by(Suburb) %>%
  summarise(count = n()) %>%
  arrange(desc(count)) 

top20_suburb <- suburb_frequencies %>%
  slice_head(n = 20)


ggplot(suburb_frequencies, aes(x = count)) +
  geom_histogram(bins = 30, fill = "skyblue", color = "white") +
  labs(title = "Histogram of Suburb Frequencies", x = "Number of Observations", y = "Frequency") +
  theme_minimal()

top20_suburb
## # A tibble: 20 × 2
##    Suburb         count
##    <fct>          <int>
##  1 Reservoir       1067
##  2 Bentleigh East   696
##  3 Richmond         642
##  4 Craigieburn      598
##  5 Preston          593
##  6 Mount Waverley   556
##  7 Brunswick        540
##  8 Northcote        496
##  9 Cheltenham       493
## 10 Glen Waverley    486
## 11 Essendon         485
## 12 Glenroy          482
## 13 Coburg           464
## 14 Mill Park        454
## 15 South Yarra      436
## 16 Glen Iris        434
## 17 Pascoe Vale      433
## 18 Kew              430
## 19 Bundoora         429
## 20 Hawthorn         428
representatiion_t10s <- round(((sum(top20_suburb$count)/nrow(data_cleaned)) * 100), digits = 2)
paste("Top 20 Suburbs observation represent",  representatiion_t10s, "% of all observation")
## [1] "Top 20 Suburbs observation represent 21.98 % of all observation"

The majority of suburbs have fewer than 200 observations, indicating a wide distribution of data across different areas. The top 20 suburbs with the most observations account for only 21.98% of the total data, highlighting the fragmented nature of the dataset.

This suggests that while these top 20 suburbs provide a significant amount of data, much of the market analysis will also need to consider the long tail of less represented suburbs. It could be interesting to study the price fluctuations between these 20 most common suburbs and the less represented ones, to see if there are significant differences in price trends that could inform broader market insights.

region_frequencies <- as.data.frame(table(data_cleaned$Regionname))

region_frequencies
##                         Var1  Freq
## 1       Eastern Metropolitan  7583
## 2           Eastern Victoria   374
## 3      Northern Metropolitan 13593
## 4          Northern Victoria   455
## 5 South-Eastern Metropolitan  4008
## 6      Southern Metropolitan 12545
## 7       Western Metropolitan  9680
## 8           Western Victoria   181
ggplot(region_frequencies, aes(x = reorder(Var1, Freq), y = Freq)) +
  geom_bar(stat = "identity", fill = "skyblue", color = "white") +
  labs(title = "Frequency of Regionname", x = "Region", y = "Frequency") +
  theme_minimal() +
  theme(axis.text.x = element_text(angle = 45, hjust = 1))

method_frequencies <- as.data.frame(table(data_cleaned$Method))

method_frequencies
##   Var1  Freq
## 1   PI  5938
## 2   PN     0
## 3    S 30616
## 4   SA   365
## 5   SN     0
## 6   SP  6477
## 7   SS     0
## 8   VB  5023
## 9    W     0
ggplot(method_frequencies, aes(x = reorder(Var1, Freq), y = Freq)) +
  geom_bar(stat = "identity", fill = "skyblue", color = "white") +
  labs(title = "Frequency of Method", x = "Method", y = "Frequency") +
  theme_minimal() +
  theme(axis.text.x = element_text(angle = 45, hjust = 1))

type_frequencies <- as.data.frame(table(data_cleaned$Type))

type_frequencies
##   Var1  Freq
## 1    h 34147
## 2    t  4980
## 3    u  9292
ggplot(type_frequencies, aes(x = reorder(Var1, Freq), y = Freq)) +
  geom_bar(stat = "identity", fill = "skyblue", color = "white") +
  labs(title = "Frequency of Type", x = "Type", y = "Frequency") +
    scale_x_discrete(labels = c("h" = "House", "u" = "Unit", "t" = "Townhouse")) +
  theme_minimal() 

councilarea_frequencies <- as.data.frame(table(data_cleaned$CouncilArea))

councilarea_frequencies
##                              Var1 Freq
## 1            Banyule City Council 2900
## 2            Bayside City Council 2495
## 3         Boroondara City Council 3454
## 4           Brimbank City Council 2720
## 5          Cardinia Shire Council   52
## 6              Casey City Council  343
## 7            Darebin City Council 3462
## 8          Frankston City Council  656
## 9          Glen Eira City Council 2351
## 10 Greater Dandenong City Council  596
## 11       Hobsons Bay City Council 1112
## 12              Hume City Council 2373
## 13          Kingston City Council 2024
## 14              Knox City Council  749
## 15   Macedon Ranges Shire Council  114
## 16        Manningham City Council 1730
## 17       Maribyrnong City Council 1734
## 18         Maroondah City Council 1006
## 19         Melbourne City Council 2054
## 20            Melton City Council  551
## 21         Mitchell Shire Council   29
## 22            Monash City Council 2439
## 23     Moonee Valley City Council 2163
## 24        Moorabool Shire Council   11
## 25          Moreland City Council 2517
## 26      Murrindindi Shire Council    1
## 27        Nillumbik Shire Council  238
## 28      Port Phillip City Council 1291
## 29       Stonnington City Council 1227
## 30        Whitehorse City Council 1319
## 31        Whittlesea City Council 2059
## 32           Wyndham City Council 1154
## 33             Yarra City Council 1320
## 34     Yarra Ranges Shire Council  175
ggplot(councilarea_frequencies, aes(x = reorder(Var1, Freq), y = Freq)) +
  geom_bar(stat = "identity", fill = "skyblue", color = "white") +
  labs(title = "Frequency of Council Area", x = "Type", y = "Frequency") +
  theme_minimal() +
  theme(axis.text.x = element_text(angle =90, hjust = 1))

Regardim the method of sale, there are 9 levels, those being: -S (Sold): The property was successfully sold at the auction.

-SP (Sold Prior): The property was sold before the auction took place.

-PI (Passed In): The property did not meet the reserve price at auction and was not sold. It may still be on the market or available for negotiation.

-PN (Sold Prior Not Disclosed): The property was sold before the auction, but the sale price was not disclosed.

-SN (Sold Not Disclosed): The property was sold at the auction, but the sale price was not disclosed.

-VB (Vendor Bid): A bid was placed by the vendor (seller) during the auction, often to help stimulate bidding, but this does not constitute a sale.

-W (Withdrawn Prior to Auction): The property was withdrawn from the auction before it took place, possibly because it was sold prior or the seller decided not to proceed.

-SA (Sold After Auction): The property did not sell during the auction but was sold afterward, either through negotiations or another method.

-SS (Sold After Auction Price Not Disclosed): The property was sold after the auction, but the sale price was not disclosed.

Out of these 9 levels, four did not occur in the cleaned dataset. This could be due to the following reasons:

PN (Sold Prior Not Disclosed), SN (Sold Not Disclosed), and SS (Sold After Auction Price Not Disclosed): These methods involve transactions where the sale price was not disclosed. During the pre-processing stage, any entries with missing price data were likely removed, which explains their absence in the cleaned dataset.

W (Withdrawn Prior to Auction): This method indicates that the property was withdrawn before the auction, likely resulting in no associated sale price. Since our analysis is focused on transactions with available price data, these entries would not be relevant and were consequently excluded.

Regarding the type variable, the distribution of property types in the dataset reflects a market where houses are the predominant type of property being sold. This could influence various aspects of the market analysis, such as price trends and buyer demographics, and should be taken into account when interpreting the results of any further analysis.





3.1 Bivariate Analysis

cor_matrix <- cor(data_cleaned %>% select_if(is.numeric))
cor_matrix
##                     Rooms        Price      Postcode Propertycount     Distance
## Rooms          1.00000000  0.414067323  0.0933474308 -0.0578419190  0.283653854
## Price          0.41406732  1.000000000  0.0031562658 -0.0607194850 -0.253832105
## Postcode       0.09334743  0.003156266  1.0000000000 -0.0009054878  0.504401504
## Propertycount -0.05784192 -0.060719485 -0.0009054878  1.0000000000  0.007647533
## Distance       0.28365385 -0.253832105  0.5044015045  0.0076475335  1.000000000
## Log_Price      0.46680504  0.927242791 -0.0139493522 -0.0851143692 -0.260759479
##                 Log_Price
## Rooms          0.46680504
## Price          0.92724279
## Postcode      -0.01394935
## Propertycount -0.08511437
## Distance      -0.26075948
## Log_Price      1.00000000
corrplot(cor_matrix)

Rooms by Price

price_by_room <- ggplot(data_cleaned, aes(x = as.factor(Rooms), y= Log_Price)) +
  geom_violin(trim = FALSE, fill = "lightblue", color = "black")+
    geom_boxplot(width = 0.1, fill = "white", outlier.color = "red") +
  labs (x = "Number of Rooms", title = "Houses Log(Prices) by Rooms")
price_by_room +
  theme_minimal()

The analysis of the relationship between the number of rooms and price (log-transformed) suggests that, as expected, houses with more rooms tend to have higher prices. However, the variation in prices within each room category is significant, indicating that the number of rooms is not the sole determinant of property prices.The presence of outliers and the wide range of values observed reinforce the need to consider other variables when modeling price. The correlation between price and distance from the city center is negative (-0.25). This suggests that properties located farther from the city center tend to have lower prices. There is a positive correlation between the number of rooms and distance from the city center (0.28). This might indicate that larger properties with more rooms are more commonly found in suburban or rural areas rather than in the city center, where space is limited and properties tend to be smaller.

price_by_type <- ggplot(data_cleaned, aes(Type, Log_Price)) +
  geom_violin(trim = FALSE, fill = "lightblue", color = "black") +
  geom_boxplot(width = 0.1, fill = "white", outlier.color = "red") +
  labs(title = "Violin Plot of Price by Type", x = "Type of Property", y = "Price") +
  scale_x_discrete(labels = c("h" = "House", "u" = "Unit", "t" = "Townhouse")) +
  theme_minimal()
price_by_type

The price distribution among different property types (houses, units, and townhouses) shows clear differences. Houses tend to have higher prices compared to units, while townhouses are in an intermediate position.

#price_by_method
price_by_method <- ggplot(data_cleaned, aes(Method, Log_Price)) +
  geom_violin(trim = FALSE, fill = "lightblue", color = "black") +
  geom_boxplot(width = 0.1, fill = "white", outlier.color = "red") +
  labs(title = "Violin Plot of Price by Method", x = "Sell Method", y = "Price") +
  scale_x_discrete(labels = c("PI" = "Passed In", "S" = "Sold", "SA" = "Sold After Auction", "SP" = "Sold Prior", "VB" = "Vendor Bid")) +
  theme_minimal()
price_by_method

#price_by_Region
price_by_region <- ggplot(data_cleaned, aes(Regionname, Log_Price)) +
  geom_violin(trim = FALSE, fill = "lightblue", color = "black") +
  geom_boxplot(width = 0.1, fill = "white", outlier.color = "red") +
  labs(title = "Violin Plot of Price by Region", x = "Region Name", y = "Price") +
  theme_minimal()+
    theme(axis.text.x = element_text(angle =45, hjust = 1))
price_by_region 

#Seller- Top sellers
sales_by_seller <- data_cleaned %>%
  group_by(SellerG) %>%
  summarise(num_sales = n(),          
    total_sales = sum(Price)  
  ) %>%
  arrange(desc(num_sales)) 
summary(sales_by_seller)
##    SellerG            num_sales        total_sales       
##  Length:422         Min.   :   1.00   Min.   :3.250e+05  
##  Class :character   1st Qu.:   1.00   1st Qu.:1.308e+06  
##  Mode  :character   Median :   5.00   Median :4.446e+06  
##                     Mean   : 114.74   Mean   :1.144e+08  
##                     3rd Qu.:  30.75   3rd Qu.:2.490e+07  
##                     Max.   :4818.00   Max.   :5.298e+09
top_sellers <- slice_head(sales_by_seller, n=20)
top_sellers
## # A tibble: 20 × 3
##    SellerG       num_sales total_sales
##    <chr>             <int>       <dbl>
##  1 Barry              4818  4022899061
##  2 Jellis             4087  5298022007
##  3 Nelson             4007  4086297166
##  4 Ray                3650  2950388659
##  5 hockingstuart      3465  3174822461
##  6 Buxton             2578  3095943274
##  7 Marshall           1720  3331132138
##  8 Fletchers          1160  1384291988
##  9 Biggin             1022  1002291929
## 10 Brad                911   743992700
## 11 Harcourts           911   783065159
## 12 YPA                 897   513150700
## 13 Woodards            872   914910059
## 14 McGrath             860   851140627
## 15 Noel                835   961840513
## 16 Hodges              713   820041438
## 17 Stockdale           693   488813242
## 18 Greg                633   717565749
## 19 HAR                 578   382470773
## 20 Jas                 573   502753790
top_money <- sales_by_seller %>%
   arrange(desc(total_sales)) %>%
  slice_head(n=20)

top_money
## # A tibble: 20 × 3
##    SellerG       num_sales total_sales
##    <chr>             <int>       <dbl>
##  1 Jellis             4087  5298022007
##  2 Nelson             4007  4086297166
##  3 Barry              4818  4022899061
##  4 Marshall           1720  3331132138
##  5 hockingstuart      3465  3174822461
##  6 Buxton             2578  3095943274
##  7 Ray                3650  2950388659
##  8 Fletchers          1160  1384291988
##  9 Biggin             1022  1002291929
## 10 Noel                835   961840513
## 11 Woodards            872   914910059
## 12 McGrath             860   851140627
## 13 Hodges              713   820041438
## 14 Harcourts           911   783065159
## 15 Brad                911   743992700
## 16 Greg                633   717565749
## 17 RT                  480   717163566
## 18 Kay                 284   589685550
## 19 Miles               529   569323888
## 20 Gary                514   530708150
percent_nsalesbyseller <- round(((sum(top_money$num_sales)/nrow(data_cleaned)) * 100), digits = 2)
percent_nsalesbyseller
## [1] 70.32
ggplot(sales_by_seller, aes(num_sales, total_sales))+
  geom_point()

The market is dominated by a few key real estate agents, such as “Barry,” “Jellis,” and “Nelson,” which together handle a large number of transactions and generate significant sales values. We can also see that the combined number of sales from the top 20 sellers represents approximately 70.32% of all sales made by the 422 agents. This further illustrate the concentration of market power in the hands of a few sellers

#working with the date variable
data_cleaned$Date <- as.Date(data_cleaned$Date, format = "%d/%m/%Y")
data_cleaned <- data_cleaned %>%
  mutate(Year = format(Date, "%Y"),
         Month = month(Date, label = TRUE, abbr = TRUE),
         day = format(Date, "%d"),
         DayOfWeekAbbrev = wday(Date, label = TRUE))

sales <- data_cleaned %>%
  group_by(Date) %>%
  summarise(sales_n = n(),
            avg_price = mean(Price))

n_salesm <- ggplot(sales, aes(x = Date, y = sales_n))+
  geom_line()+
  theme_minimal() +
  scale_x_date(date_breaks = "1 month", date_labels = "%b %Y") +  
  theme(axis.text.x = element_text(angle = 45, hjust = 1))

price_flut <- ggplot(sales, aes(x = Date, y = avg_price))+
  geom_line()+
  theme_minimal() +
  scale_x_date(date_breaks = "1 month", date_labels = "%b %Y") + 
  theme(axis.text.x = element_text(angle = 45, hjust = 1))

n_salesm

price_flut

s_year <- data_cleaned %>%
  group_by(Year) %>%
  summarise(sales_n = n(), 
            avg_price = (mean(Price)/1000000))

s_year
## # A tibble: 3 × 3
##   Year  sales_n avg_price
##   <chr>   <int>     <dbl>
## 1 2016    13081     0.966
## 2 2017    20270     1.02 
## 3 2018    15068     0.996

There was an increase in the number of transactions and average prices from 2016 to 2017, followed by a decline in both metrics in 2018.

s_month <- data_cleaned %>%
  group_by(Month) %>%
  summarise(sales_n = n(), 
            avg_price = (mean(Price)/1000000))

by_month <- ggplot(s_month, aes(x = Month, y = sales_n)) +
  geom_bar(stat = "identity", fill = "skyblue", color = "white") +
  theme_minimal() +
  labs(y = "Number of Sales", title = "Number of Sales by Month")


by_month_pflut <- ggplot(s_month, aes(x = Month, y = avg_price)) +
  geom_bar(stat = "identity", fill = "skyblue", color = "white") +
  theme_minimal() +
  scale_y_continuous(labels = label_comma(suffix = "M")) +  
  labs(y = "Average Price (in millions)", title = "Price by Month")

s_month
## # A tibble: 12 × 3
##    Month sales_n avg_price
##    <ord>   <int>     <dbl>
##  1 Jan      1281     0.947
##  2 Feb      2552     0.977
##  3 Mar      3835     1.06 
##  4 Apr      4415     0.942
##  5 May      5716     1.02 
##  6 Jun      4679     1.00 
##  7 Jul      3771     0.873
##  8 Aug      4152     1.02 
##  9 Sep      5515     1.03 
## 10 Oct      4243     1.01 
## 11 Nov      4351     1.02 
## 12 Dec      3909     0.990
by_month

by_month_pflut

s_week <-data_cleaned %>%
  group_by(DayOfWeekAbbrev) %>%
  summarise(sales_n = n(),
            avg_price = (mean(Price)/1000000))

by_day <- ggplot(s_week, aes(x = DayOfWeekAbbrev, y = sales_n)) +
  geom_bar(stat = "identity", fill = "skyblue", color = "white") +
  theme_minimal() +
  labs(y = "Number of Sales", title = "Number of Sales by Month") 

by_day_pflut <- ggplot(s_week, aes(x = DayOfWeekAbbrev, y = avg_price), color = Year) +
  geom_bar(stat = "identity", fill = "skyblue", color = "white") +
  theme_minimal() +
  scale_y_continuous(labels = label_comma(suffix = "M")) +  
  labs(y = "Average Price (in millions) ", title = "Price by Week Day")

s_week
## # A tibble: 5 × 3
##   DayOfWeekAbbrev sales_n avg_price
##   <ord>             <int>     <dbl>
## 1 Sun                2201     1.01 
## 2 Mon                1894     0.929
## 3 Tue                 254     0.896
## 4 Thu                  99     0.747
## 5 Sat               43971     1.00
by_day

by_day_pflut

Most property transactions occur on weekends, particularly on Saturdays. Prices on Sundays appear to be slightly higher on average, possibly due to the finalization of deals or higher-quality properties being transacted.





4 Regression Analysis

data_cleaned$Rooms <- as.factor(data_cleaned$Rooms)
model1 <- lm(Log_Price ~ Rooms + Distance + Regionname + Type + Regionname + Propertycount + Date + CouncilArea, data = data_cleaned)
summary(model1)
## 
## Call:
## lm(formula = Log_Price ~ Rooms + Distance + Regionname + Type + 
##     Regionname + Propertycount + Date + CouncilArea, data = data_cleaned)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -2.5024 -0.1668 -0.0126  0.1488  2.2551 
## 
## Coefficients:
##                                             Estimate Std. Error  t value
## (Intercept)                                1.037e+01  8.101e-02  128.014
## Rooms2                                     5.179e-01  7.027e-03   73.691
## Rooms3                                     7.678e-01  7.490e-03  102.518
## Rooms4                                     9.570e-01  7.883e-03  121.405
## Rooms5                                     1.113e+00  9.283e-03  119.954
## Rooms6                                     1.165e+00  1.718e-02   67.836
## Rooms7                                     1.136e+00  4.388e-02   25.881
## Rooms8                                     1.169e+00  5.996e-02   19.491
## Distance                                  -2.768e-02  4.457e-04  -62.092
## RegionnameEastern Victoria                -8.048e-02  2.432e-02   -3.309
## RegionnameNorthern Metropolitan           -2.161e-01  8.204e-03  -26.336
## RegionnameNorthern Victoria               -1.168e-01  2.764e-02   -4.227
## RegionnameSouth-Eastern Metropolitan      -1.253e-01  1.098e-02  -11.415
## RegionnameSouthern Metropolitan           -1.063e-01  8.675e-03  -12.250
## RegionnameWestern Metropolitan            -1.279e-01  1.476e-02   -8.666
## RegionnameWestern Victoria                -1.448e-01  2.907e-02   -4.981
## Typet                                     -2.078e-01  4.073e-03  -51.008
## Typeu                                     -4.422e-01  4.001e-03 -110.519
## Propertycount                             -1.086e-06  3.271e-07   -3.320
## Date                                       1.728e-04  4.633e-06   37.298
## CouncilAreaBayside City Council            5.613e-01  1.069e-02   52.491
## CouncilAreaBoroondara City Council         4.870e-01  1.022e-02   47.642
## CouncilAreaBrimbank City Council          -2.621e-01  1.491e-02  -17.575
## CouncilAreaCardinia Shire Council          2.946e-01  4.490e-02    6.560
## CouncilAreaCasey City Council              1.383e-01  2.031e-02    6.808
## CouncilAreaDarebin City Council            1.359e-01  8.905e-03   15.261
## CouncilAreaFrankston City Council          3.922e-01  1.810e-02   21.674
## CouncilAreaGlen Eira City Council          3.322e-01  1.068e-02   31.107
## CouncilAreaGreater Dandenong City Council  9.393e-02  1.624e-02    5.785
## CouncilAreaHobsons Bay City Council        4.077e-02  1.650e-02    2.472
## CouncilAreaHume City Council              -2.082e-01  9.217e-03  -22.589
## CouncilAreaKingston City Council           3.893e-01  1.290e-02   30.175
## CouncilAreaKnox City Council               3.211e-02  1.171e-02    2.743
## CouncilAreaMacedon Ranges Shire Council    6.272e-01  3.697e-02   16.964
## CouncilAreaManningham City Council         1.740e-01  8.646e-03   20.120
## CouncilAreaMaribyrnong City Council       -9.622e-02  1.616e-02   -5.954
## CouncilAreaMaroondah City Council          9.084e-02  1.081e-02    8.406
## CouncilAreaMelbourne City Council          2.983e-01  9.912e-03   30.097
## CouncilAreaMelton City Council            -2.991e-01  1.927e-02  -15.523
## CouncilAreaMitchell Shire Council          1.489e-01  5.549e-02    2.683
## CouncilAreaMonash City Council             2.709e-01  8.649e-03   31.318
## CouncilAreaMoonee Valley City Council      1.119e-01  1.568e-02    7.140
## CouncilAreaMoorabool Shire Council         9.516e-02  8.300e-02    1.147
## CouncilAreaMoreland City Council           8.388e-02  8.947e-03    9.375
## CouncilAreaMurrindindi Shire Council       7.262e-01  2.607e-01    2.786
## CouncilAreaNillumbik Shire Council        -3.345e-02  2.893e-02   -1.156
## CouncilAreaPort Phillip City Council       3.296e-01  1.215e-02   27.130
## CouncilAreaStonnington City Council        4.831e-01  1.206e-02   40.048
## CouncilAreaWhitehorse City Council         1.613e-01  9.315e-03   17.320
## CouncilAreaWhittlesea City Council        -1.184e-01  9.396e-03  -12.598
## CouncilAreaWyndham City Council           -3.759e-01  1.609e-02  -23.363
## CouncilAreaYarra City Council              3.341e-01  1.103e-02   30.274
## CouncilAreaYarra Ranges Shire Council      1.362e-01  3.062e-02    4.449
##                                           Pr(>|t|)    
## (Intercept)                                < 2e-16 ***
## Rooms2                                     < 2e-16 ***
## Rooms3                                     < 2e-16 ***
## Rooms4                                     < 2e-16 ***
## Rooms5                                     < 2e-16 ***
## Rooms6                                     < 2e-16 ***
## Rooms7                                     < 2e-16 ***
## Rooms8                                     < 2e-16 ***
## Distance                                   < 2e-16 ***
## RegionnameEastern Victoria                0.000938 ***
## RegionnameNorthern Metropolitan            < 2e-16 ***
## RegionnameNorthern Victoria               2.37e-05 ***
## RegionnameSouth-Eastern Metropolitan       < 2e-16 ***
## RegionnameSouthern Metropolitan            < 2e-16 ***
## RegionnameWestern Metropolitan             < 2e-16 ***
## RegionnameWestern Victoria                6.34e-07 ***
## Typet                                      < 2e-16 ***
## Typeu                                      < 2e-16 ***
## Propertycount                             0.000901 ***
## Date                                       < 2e-16 ***
## CouncilAreaBayside City Council            < 2e-16 ***
## CouncilAreaBoroondara City Council         < 2e-16 ***
## CouncilAreaBrimbank City Council           < 2e-16 ***
## CouncilAreaCardinia Shire Council         5.43e-11 ***
## CouncilAreaCasey City Council             9.98e-12 ***
## CouncilAreaDarebin City Council            < 2e-16 ***
## CouncilAreaFrankston City Council          < 2e-16 ***
## CouncilAreaGlen Eira City Council          < 2e-16 ***
## CouncilAreaGreater Dandenong City Council 7.28e-09 ***
## CouncilAreaHobsons Bay City Council       0.013455 *  
## CouncilAreaHume City Council               < 2e-16 ***
## CouncilAreaKingston City Council           < 2e-16 ***
## CouncilAreaKnox City Council              0.006097 ** 
## CouncilAreaMacedon Ranges Shire Council    < 2e-16 ***
## CouncilAreaManningham City Council         < 2e-16 ***
## CouncilAreaMaribyrnong City Council       2.63e-09 ***
## CouncilAreaMaroondah City Council          < 2e-16 ***
## CouncilAreaMelbourne City Council          < 2e-16 ***
## CouncilAreaMelton City Council             < 2e-16 ***
## CouncilAreaMitchell Shire Council         0.007295 ** 
## CouncilAreaMonash City Council             < 2e-16 ***
## CouncilAreaMoonee Valley City Council     9.47e-13 ***
## CouncilAreaMoorabool Shire Council        0.251576    
## CouncilAreaMoreland City Council           < 2e-16 ***
## CouncilAreaMurrindindi Shire Council      0.005343 ** 
## CouncilAreaNillumbik Shire Council        0.247577    
## CouncilAreaPort Phillip City Council       < 2e-16 ***
## CouncilAreaStonnington City Council        < 2e-16 ***
## CouncilAreaWhitehorse City Council         < 2e-16 ***
## CouncilAreaWhittlesea City Council         < 2e-16 ***
## CouncilAreaWyndham City Council            < 2e-16 ***
## CouncilAreaYarra City Council              < 2e-16 ***
## CouncilAreaYarra Ranges Shire Council     8.64e-06 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.2592 on 48366 degrees of freedom
## Multiple R-squared:  0.7281, Adjusted R-squared:  0.7279 
## F-statistic:  2491 on 52 and 48366 DF,  p-value: < 2.2e-16
library(modelr)
## 
## Attaching package: 'modelr'
## The following object is masked from 'package:broom':
## 
##     bootstrap
data.frame(R2 = rsquare(model1, data = data_cleaned),
  RMSE = rmse(model1, data = data_cleaned),
  MAE =mae(model1, data = data_cleaned))
##          R2      RMSE       MAE
## 1 0.7281469 0.2590646 0.1975971
glance(model1)
## # A tibble: 1 × 12
##   r.squared adj.r.squared sigma statistic p.value    df logLik   AIC   BIC
##       <dbl>         <dbl> <dbl>     <dbl>   <dbl> <dbl>  <dbl> <dbl> <dbl>
## 1     0.728         0.728 0.259     2491.       0    52 -3305. 6718. 7193.
## # ℹ 3 more variables: deviance <dbl>, df.residual <int>, nobs <int>
data_cleaned$Rooms <- as.factor(data_cleaned$Rooms)
model2 <- lm(Log_Price ~ Rooms + Distance + Regionname + Type + Regionname + Propertycount + CouncilArea, data = data_cleaned)
summary(model2)
## 
## Call:
## lm(formula = Log_Price ~ Rooms + Distance + Regionname + Type + 
##     Regionname + Propertycount + CouncilArea, data = data_cleaned)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -2.53795 -0.16902 -0.01025  0.15262  2.26022 
## 
## Coefficients:
##                                             Estimate Std. Error  t value
## (Intercept)                                1.336e+01  1.100e-02 1215.118
## Rooms2                                     5.201e-01  7.127e-03   72.971
## Rooms3                                     7.701e-01  7.596e-03  101.374
## Rooms4                                     9.600e-01  7.995e-03  120.076
## Rooms5                                     1.115e+00  9.415e-03  118.475
## Rooms6                                     1.169e+00  1.742e-02   67.090
## Rooms7                                     1.132e+00  4.451e-02   25.443
## Rooms8                                     1.164e+00  6.081e-02   19.142
## Distance                                  -2.740e-02  4.520e-04  -60.619
## RegionnameEastern Victoria                -7.238e-02  2.467e-02   -2.934
## RegionnameNorthern Metropolitan           -2.143e-01  8.321e-03  -25.758
## RegionnameNorthern Victoria               -1.056e-01  2.804e-02   -3.768
## RegionnameSouth-Eastern Metropolitan      -1.263e-01  1.114e-02  -11.340
## RegionnameSouthern Metropolitan           -1.059e-01  8.799e-03  -12.034
## RegionnameWestern Metropolitan            -1.225e-01  1.497e-02   -8.186
## RegionnameWestern Victoria                -1.340e-01  2.949e-02   -4.543
## Typet                                     -2.039e-01  4.130e-03  -49.377
## Typeu                                     -4.400e-01  4.058e-03 -108.427
## Propertycount                             -1.136e-06  3.317e-07   -3.425
## CouncilAreaBayside City Council            5.581e-01  1.085e-02   51.455
## CouncilAreaBoroondara City Council         4.888e-01  1.037e-02   47.155
## CouncilAreaBrimbank City Council          -2.657e-01  1.512e-02  -17.569
## CouncilAreaCardinia Shire Council          2.830e-01  4.554e-02    6.214
## CouncilAreaCasey City Council              1.354e-01  2.060e-02    6.573
## CouncilAreaDarebin City Council            1.411e-01  9.031e-03   15.628
## CouncilAreaFrankston City Council          3.855e-01  1.835e-02   21.006
## CouncilAreaGlen Eira City Council          3.324e-01  1.083e-02   30.683
## CouncilAreaGreater Dandenong City Council  9.235e-02  1.647e-02    5.608
## CouncilAreaHobsons Bay City Council        3.640e-02  1.673e-02    2.176
## CouncilAreaHume City Council              -1.999e-01  9.346e-03  -21.391
## CouncilAreaKingston City Council           3.880e-01  1.309e-02   29.651
## CouncilAreaKnox City Council               3.748e-02  1.187e-02    3.157
## CouncilAreaMacedon Ranges Shire Council    6.194e-01  3.750e-02   16.516
## CouncilAreaManningham City Council         1.703e-01  8.769e-03   19.425
## CouncilAreaMaribyrnong City Council       -9.698e-02  1.639e-02   -5.917
## CouncilAreaMaroondah City Council          8.949e-02  1.096e-02    8.165
## CouncilAreaMelbourne City Council          2.991e-01  1.005e-02   29.745
## CouncilAreaMelton City Council            -3.000e-01  1.954e-02  -15.352
## CouncilAreaMitchell Shire Council          1.549e-01  5.628e-02    2.752
## CouncilAreaMonash City Council             2.680e-01  8.772e-03   30.549
## CouncilAreaMoonee Valley City Council      1.095e-01  1.590e-02    6.885
## CouncilAreaMoorabool Shire Council         1.133e-01  8.418e-02    1.346
## CouncilAreaMoreland City Council           8.566e-02  9.075e-03    9.439
## CouncilAreaMurrindindi Shire Council       7.644e-01  2.644e-01    2.891
## CouncilAreaNillumbik Shire Council        -2.884e-02  2.935e-02   -0.983
## CouncilAreaPort Phillip City Council       3.318e-01  1.232e-02   26.930
## CouncilAreaStonnington City Council        4.854e-01  1.223e-02   39.673
## CouncilAreaWhitehorse City Council         1.612e-01  9.448e-03   17.057
## CouncilAreaWhittlesea City Council        -1.132e-01  9.529e-03  -11.879
## CouncilAreaWyndham City Council           -3.644e-01  1.632e-02  -22.333
## CouncilAreaYarra City Council              3.387e-01  1.119e-02   30.261
## CouncilAreaYarra Ranges Shire Council      1.333e-01  3.106e-02    4.292
##                                           Pr(>|t|)    
## (Intercept)                                < 2e-16 ***
## Rooms2                                     < 2e-16 ***
## Rooms3                                     < 2e-16 ***
## Rooms4                                     < 2e-16 ***
## Rooms5                                     < 2e-16 ***
## Rooms6                                     < 2e-16 ***
## Rooms7                                     < 2e-16 ***
## Rooms8                                     < 2e-16 ***
## Distance                                   < 2e-16 ***
## RegionnameEastern Victoria                0.003345 ** 
## RegionnameNorthern Metropolitan            < 2e-16 ***
## RegionnameNorthern Victoria               0.000165 ***
## RegionnameSouth-Eastern Metropolitan       < 2e-16 ***
## RegionnameSouthern Metropolitan            < 2e-16 ***
## RegionnameWestern Metropolitan            2.76e-16 ***
## RegionnameWestern Victoria                5.55e-06 ***
## Typet                                      < 2e-16 ***
## Typeu                                      < 2e-16 ***
## Propertycount                             0.000616 ***
## CouncilAreaBayside City Council            < 2e-16 ***
## CouncilAreaBoroondara City Council         < 2e-16 ***
## CouncilAreaBrimbank City Council           < 2e-16 ***
## CouncilAreaCardinia Shire Council         5.21e-10 ***
## CouncilAreaCasey City Council             4.97e-11 ***
## CouncilAreaDarebin City Council            < 2e-16 ***
## CouncilAreaFrankston City Council          < 2e-16 ***
## CouncilAreaGlen Eira City Council          < 2e-16 ***
## CouncilAreaGreater Dandenong City Council 2.06e-08 ***
## CouncilAreaHobsons Bay City Council       0.029586 *  
## CouncilAreaHume City Council               < 2e-16 ***
## CouncilAreaKingston City Council           < 2e-16 ***
## CouncilAreaKnox City Council              0.001597 ** 
## CouncilAreaMacedon Ranges Shire Council    < 2e-16 ***
## CouncilAreaManningham City Council         < 2e-16 ***
## CouncilAreaMaribyrnong City Council       3.30e-09 ***
## CouncilAreaMaroondah City Council         3.28e-16 ***
## CouncilAreaMelbourne City Council          < 2e-16 ***
## CouncilAreaMelton City Council             < 2e-16 ***
## CouncilAreaMitchell Shire Council         0.005934 ** 
## CouncilAreaMonash City Council             < 2e-16 ***
## CouncilAreaMoonee Valley City Council     5.86e-12 ***
## CouncilAreaMoorabool Shire Council        0.178208    
## CouncilAreaMoreland City Council           < 2e-16 ***
## CouncilAreaMurrindindi Shire Council      0.003844 ** 
## CouncilAreaNillumbik Shire Council        0.325664    
## CouncilAreaPort Phillip City Council       < 2e-16 ***
## CouncilAreaStonnington City Council        < 2e-16 ***
## CouncilAreaWhitehorse City Council         < 2e-16 ***
## CouncilAreaWhittlesea City Council         < 2e-16 ***
## CouncilAreaWyndham City Council            < 2e-16 ***
## CouncilAreaYarra City Council              < 2e-16 ***
## CouncilAreaYarra Ranges Shire Council     1.77e-05 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.2629 on 48367 degrees of freedom
## Multiple R-squared:  0.7203, Adjusted R-squared:   0.72 
## F-statistic:  2443 on 51 and 48367 DF,  p-value: < 2.2e-16
data.frame(R2 = rsquare(model2, data = data_cleaned),
  RMSE = rmse(model2, data = data_cleaned),
  MAE =mae(model2, data = data_cleaned))
##          R2      RMSE       MAE
## 1 0.7203278 0.2627639 0.2006794
glance(model2)
## # A tibble: 1 × 12
##   r.squared adj.r.squared sigma statistic p.value    df logLik   AIC   BIC
##       <dbl>         <dbl> <dbl>     <dbl>   <dbl> <dbl>  <dbl> <dbl> <dbl>
## 1     0.720         0.720 0.263     2443.       0    51 -3992. 8089. 8555.
## # ℹ 3 more variables: deviance <dbl>, df.residual <int>, nobs <int>
model3 <- lm(Log_Price ~ Rooms + Distance + Type + Regionname + Date + CouncilArea, data = data_cleaned)
summary(model3)
## 
## Call:
## lm(formula = Log_Price ~ Rooms + Distance + Type + Regionname + 
##     Date + CouncilArea, data = data_cleaned)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -2.49827 -0.16693 -0.01301  0.14887  2.25624 
## 
## Coefficients:
##                                             Estimate Std. Error  t value
## (Intercept)                                1.037e+01  8.100e-02  127.964
## Rooms2                                     5.185e-01  7.025e-03   73.812
## Rooms3                                     7.683e-01  7.489e-03  102.596
## Rooms4                                     9.577e-01  7.881e-03  121.509
## Rooms5                                     1.114e+00  9.283e-03  119.977
## Rooms6                                     1.165e+00  1.718e-02   67.837
## Rooms7                                     1.136e+00  4.388e-02   25.887
## Rooms8                                     1.170e+00  5.996e-02   19.507
## Distance                                  -2.792e-02  4.398e-04  -63.476
## Typet                                     -2.079e-01  4.074e-03  -51.043
## Typeu                                     -4.430e-01  3.995e-03 -110.895
## RegionnameEastern Victoria                -7.959e-02  2.432e-02   -3.272
## RegionnameNorthern Metropolitan           -2.154e-01  8.202e-03  -26.256
## RegionnameNorthern Victoria               -1.135e-01  2.763e-02   -4.107
## RegionnameSouth-Eastern Metropolitan      -1.222e-01  1.094e-02  -11.173
## RegionnameSouthern Metropolitan           -1.045e-01  8.660e-03  -12.068
## RegionnameWestern Metropolitan            -1.263e-01  1.475e-02   -8.562
## RegionnameWestern Victoria                -1.397e-01  2.903e-02   -4.810
## Date                                       1.729e-04  4.633e-06   37.308
## CouncilAreaBayside City Council            5.583e-01  1.066e-02   52.392
## CouncilAreaBoroondara City Council         4.816e-01  1.009e-02   47.710
## CouncilAreaBrimbank City Council          -2.633e-01  1.491e-02  -17.662
## CouncilAreaCardinia Shire Council          2.951e-01  4.491e-02    6.571
## CouncilAreaCasey City Council              1.363e-01  2.031e-02    6.714
## CouncilAreaDarebin City Council            1.252e-01  8.304e-03   15.079
## CouncilAreaFrankston City Council          3.884e-01  1.806e-02   21.505
## CouncilAreaGlen Eira City Council          3.284e-01  1.062e-02   30.926
## CouncilAreaGreater Dandenong City Council  8.932e-02  1.618e-02    5.521
## CouncilAreaHobsons Bay City Council        3.853e-02  1.648e-02    2.338
## CouncilAreaHume City Council              -2.104e-01  9.194e-03  -22.886
## CouncilAreaKingston City Council           3.901e-01  1.290e-02   30.235
## CouncilAreaKnox City Council               3.145e-02  1.171e-02    2.687
## CouncilAreaMacedon Ranges Shire Council    6.349e-01  3.690e-02   17.204
## CouncilAreaManningham City Council         1.725e-01  8.637e-03   19.978
## CouncilAreaMaribyrnong City Council       -9.914e-02  1.614e-02   -6.143
## CouncilAreaMaroondah City Council          9.162e-02  1.080e-02    8.480
## CouncilAreaMelbourne City Council          2.912e-01  9.674e-03   30.095
## CouncilAreaMelton City Council            -2.987e-01  1.927e-02  -15.502
## CouncilAreaMitchell Shire Council          1.546e-01  5.547e-02    2.787
## CouncilAreaMonash City Council             2.661e-01  8.531e-03   31.195
## CouncilAreaMoonee Valley City Council      1.092e-01  1.566e-02    6.976
## CouncilAreaMoorabool Shire Council         9.905e-02  8.300e-02    1.193
## CouncilAreaMoreland City Council           7.947e-02  8.849e-03    8.981
## CouncilAreaMurrindindi Shire Council       7.364e-01  2.607e-01    2.824
## CouncilAreaNillumbik Shire Council        -3.329e-02  2.894e-02   -1.151
## CouncilAreaPort Phillip City Council       3.234e-01  1.200e-02   26.938
## CouncilAreaStonnington City Council        4.782e-01  1.197e-02   39.937
## CouncilAreaWhitehorse City Council         1.622e-01  9.312e-03   17.423
## CouncilAreaWhittlesea City Council        -1.210e-01  9.363e-03  -12.927
## CouncilAreaWyndham City Council           -3.849e-01  1.586e-02  -24.266
## CouncilAreaYarra City Council              3.269e-01  1.082e-02   30.206
## CouncilAreaYarra Ranges Shire Council      1.386e-01  3.061e-02    4.528
##                                           Pr(>|t|)    
## (Intercept)                                < 2e-16 ***
## Rooms2                                     < 2e-16 ***
## Rooms3                                     < 2e-16 ***
## Rooms4                                     < 2e-16 ***
## Rooms5                                     < 2e-16 ***
## Rooms6                                     < 2e-16 ***
## Rooms7                                     < 2e-16 ***
## Rooms8                                     < 2e-16 ***
## Distance                                   < 2e-16 ***
## Typet                                      < 2e-16 ***
## Typeu                                      < 2e-16 ***
## RegionnameEastern Victoria                 0.00107 ** 
## RegionnameNorthern Metropolitan            < 2e-16 ***
## RegionnameNorthern Victoria               4.01e-05 ***
## RegionnameSouth-Eastern Metropolitan       < 2e-16 ***
## RegionnameSouthern Metropolitan            < 2e-16 ***
## RegionnameWestern Metropolitan             < 2e-16 ***
## RegionnameWestern Victoria                1.51e-06 ***
## Date                                       < 2e-16 ***
## CouncilAreaBayside City Council            < 2e-16 ***
## CouncilAreaBoroondara City Council         < 2e-16 ***
## CouncilAreaBrimbank City Council           < 2e-16 ***
## CouncilAreaCardinia Shire Council         5.04e-11 ***
## CouncilAreaCasey City Council             1.91e-11 ***
## CouncilAreaDarebin City Council            < 2e-16 ***
## CouncilAreaFrankston City Council          < 2e-16 ***
## CouncilAreaGlen Eira City Council          < 2e-16 ***
## CouncilAreaGreater Dandenong City Council 3.39e-08 ***
## CouncilAreaHobsons Bay City Council        0.01941 *  
## CouncilAreaHume City Council               < 2e-16 ***
## CouncilAreaKingston City Council           < 2e-16 ***
## CouncilAreaKnox City Council               0.00722 ** 
## CouncilAreaMacedon Ranges Shire Council    < 2e-16 ***
## CouncilAreaManningham City Council         < 2e-16 ***
## CouncilAreaMaribyrnong City Council       8.15e-10 ***
## CouncilAreaMaroondah City Council          < 2e-16 ***
## CouncilAreaMelbourne City Council          < 2e-16 ***
## CouncilAreaMelton City Council             < 2e-16 ***
## CouncilAreaMitchell Shire Council          0.00532 ** 
## CouncilAreaMonash City Council             < 2e-16 ***
## CouncilAreaMoonee Valley City Council     3.07e-12 ***
## CouncilAreaMoorabool Shire Council         0.23271    
## CouncilAreaMoreland City Council           < 2e-16 ***
## CouncilAreaMurrindindi Shire Council       0.00474 ** 
## CouncilAreaNillumbik Shire Council         0.24990    
## CouncilAreaPort Phillip City Council       < 2e-16 ***
## CouncilAreaStonnington City Council        < 2e-16 ***
## CouncilAreaWhitehorse City Council         < 2e-16 ***
## CouncilAreaWhittlesea City Council         < 2e-16 ***
## CouncilAreaWyndham City Council            < 2e-16 ***
## CouncilAreaYarra City Council              < 2e-16 ***
## CouncilAreaYarra Ranges Shire Council     5.98e-06 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.2592 on 48367 degrees of freedom
## Multiple R-squared:  0.7281, Adjusted R-squared:  0.7278 
## F-statistic:  2539 on 51 and 48367 DF,  p-value: < 2.2e-16
data.frame(R2 = rsquare(model3, data = data_cleaned),
  RMSE = rmse(model3, data = data_cleaned),
  MAE =mae(model3, data = data_cleaned))
##          R2      RMSE       MAE
## 1 0.7280849 0.2590942 0.1976344
glance(model3)
## # A tibble: 1 × 12
##   r.squared adj.r.squared sigma statistic p.value    df logLik   AIC   BIC
##       <dbl>         <dbl> <dbl>     <dbl>   <dbl> <dbl>  <dbl> <dbl> <dbl>
## 1     0.728         0.728 0.259     2539.       0    51 -3311. 6727. 7193.
## # ℹ 3 more variables: deviance <dbl>, df.residual <int>, nobs <int>

To analyze the factors influencing property prices in the Melbourne housing market, three different regression models were compared. These models varied in the selection and combination of independent variables, including Rooms, Distance, Regionname, Type, Propertycount, Date, and CouncilArea.

After a detailed comparison based on performance metrics such as R-squared, RMSE (Root Mean Squared Error), and MAE (Mean Absolute Error), Model 3 was selected as the most suitable model. This model exhibited the best balance of predictive accuracy and parsimony, with an R-squared value of 0.728, indicating that approximately 72.8% of the variance in the log-transformed property prices (Log_Price) is explained by the model.



Key Findings from the Regression Model

-Number of Rooms (Rooms): The coefficient for Rooms was positive and significant, indicating that as the number of rooms increases, the log-transformed property price also increases. Specifically, houses with more rooms generally have higher prices, which aligns with market expectations. For instance, properties with 4 rooms had a coefficient of 0.957, meaning that all else being equal, these properties tend to have approximately 95.7% higher log-transformed prices compared to properties with only one room.



-Distance from City Center (Distance): There was a negative and significant relationship between Distance and Log_Price. The coefficient of -0.0277 suggests that properties located farther from the city center tend to have lower prices. Specifically, for every one-kilometer increase in distance from the city center, the log-transformed price decreases by about 2.77%, holding other factors constant.



-Region and Council Area: The model included several categorical variables for Regionname and CouncilArea, with significant differences observed across these regions. For example, properties in the Bayside City Council area had a coefficient of 0.561, indicating significantly higher prices compared to properties in other council areas. Conversely, regions like Western Metropolitan and Western Victoria showed negative coefficients, reflecting lower property prices relative to the reference region.



-Property Type (Type): Property type was also a significant predictor, with Type categories indicating that houses (h) generally have higher prices compared to units (u) and townhouses (t). For example, the coefficient for Typeu (units) was -0.442, suggesting that units are associated with lower prices than houses.



-Property Count (Propertycount): Although the effect of Propertycount was statistically significant, its impact on price was relatively small, as indicated by the low coefficient. This suggests that while the number of properties in a suburb may have some influence on individual property prices, it is not a major determinant.

plot(model3)

## Warning in sqrt(crit * p * (1 - hh)/hh): NaNs produced

## Warning in sqrt(crit * p * (1 - hh)/hh): NaNs produced

vif(model3)
##                     GVIF Df GVIF^(1/(2*Df))
## Rooms       1.968611e+00  7        1.049570
## Distance    7.944009e+00  1        2.818512
## Type        1.817605e+00  2        1.161114
## Regionname  6.345246e+04  7        2.203089
## Date        1.012013e+00  1        1.005989
## CouncilArea 2.615808e+05 33        1.208050
ks_test <- ks.test(model3$residuals, "pnorm", mean = mean(model3$residuals), sd = sd(model3$residuals))
## Warning in ks.test.default(model3$residuals, "pnorm", mean =
## mean(model3$residuals), : ties should not be present for the Kolmogorov-Smirnov
## test
ks_test
## 
##  Asymptotic one-sample Kolmogorov-Smirnov test
## 
## data:  model3$residuals
## D = 0.036032, p-value < 2.2e-16
## alternative hypothesis: two-sided
ks_test <- ks.test(model3$residuals, "pnorm", mean = mean(model3$residuals), sd = sd(model3$residuals))
## Warning in ks.test.default(model3$residuals, "pnorm", mean =
## mean(model3$residuals), : ties should not be present for the Kolmogorov-Smirnov
## test
ks_test
## 
##  Asymptotic one-sample Kolmogorov-Smirnov test
## 
## data:  model3$residuals
## D = 0.036032, p-value < 2.2e-16
## alternative hypothesis: two-sided
plot(model3, 4)

plot(model3, 5)
## Warning in sqrt(crit * p * (1 - hh)/hh): NaNs produced

## Warning in sqrt(crit * p * (1 - hh)/hh): NaNs produced



Assumption Checking



-Linearity: The relationships between the predictors and the response variable were examined and found to be sufficiently linear for the purposes of this analysis.

-Normality of Residuals: The residuals exhibited some deviation from normality, as indicated by the Kolmogorov-Smirnov test (p-value < 2.2e-16). However, given the large sample size, the Central Limit Theorem provides some justification for the normality assumption, and the impact on the overall model interpretation was considered minimal.



-Homoscedasticity: The plot of residuals versus fitted values showed no clear pattern, suggesting that the assumption of homoscedasticity (constant variance of residuals) was reasonably met.

Multicollinearity: Variance Inflation Factors (VIFs) were calculated to assess multicollinearity among the predictors. Although some multicollinearity was present, particularly between Regionname and CouncilArea, it was within acceptable limits and did not significantly undermine the model’s stability.

#Test for the outlier remotion
a <- augment(model3) %>%
  arrange(desc(.cooksd)) %>%
  head()

influential_points <- a %>%
  filter(.cooksd > 4 / nrow(a))
data_cleaned_no_outliers <- data_cleaned %>%
  filter(!(rownames(data_cleaned) %in% rownames(influential_points)))

model3_no_outliers <- lm(Log_Price ~ Rooms + Distance + Type + Regionname + Date + CouncilArea, data = data_cleaned_no_outliers)


summary(model3_no_outliers)
## 
## Call:
## lm(formula = Log_Price ~ Rooms + Distance + Type + Regionname + 
##     Date + CouncilArea, data = data_cleaned_no_outliers)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -2.49827 -0.16693 -0.01301  0.14887  2.25624 
## 
## Coefficients:
##                                             Estimate Std. Error  t value
## (Intercept)                                1.037e+01  8.100e-02  127.964
## Rooms2                                     5.185e-01  7.025e-03   73.812
## Rooms3                                     7.683e-01  7.489e-03  102.596
## Rooms4                                     9.577e-01  7.881e-03  121.509
## Rooms5                                     1.114e+00  9.283e-03  119.977
## Rooms6                                     1.165e+00  1.718e-02   67.837
## Rooms7                                     1.136e+00  4.388e-02   25.887
## Rooms8                                     1.170e+00  5.996e-02   19.507
## Distance                                  -2.792e-02  4.398e-04  -63.476
## Typet                                     -2.079e-01  4.074e-03  -51.043
## Typeu                                     -4.430e-01  3.995e-03 -110.895
## RegionnameEastern Victoria                -7.959e-02  2.432e-02   -3.272
## RegionnameNorthern Metropolitan           -2.154e-01  8.202e-03  -26.256
## RegionnameNorthern Victoria               -1.135e-01  2.763e-02   -4.107
## RegionnameSouth-Eastern Metropolitan      -1.222e-01  1.094e-02  -11.173
## RegionnameSouthern Metropolitan           -1.045e-01  8.660e-03  -12.068
## RegionnameWestern Metropolitan            -1.263e-01  1.475e-02   -8.562
## RegionnameWestern Victoria                -1.397e-01  2.903e-02   -4.810
## Date                                       1.729e-04  4.633e-06   37.308
## CouncilAreaBayside City Council            5.583e-01  1.066e-02   52.392
## CouncilAreaBoroondara City Council         4.816e-01  1.009e-02   47.710
## CouncilAreaBrimbank City Council          -2.633e-01  1.491e-02  -17.662
## CouncilAreaCardinia Shire Council          2.951e-01  4.491e-02    6.571
## CouncilAreaCasey City Council              1.363e-01  2.031e-02    6.714
## CouncilAreaDarebin City Council            1.252e-01  8.304e-03   15.079
## CouncilAreaFrankston City Council          3.884e-01  1.806e-02   21.505
## CouncilAreaGlen Eira City Council          3.284e-01  1.062e-02   30.926
## CouncilAreaGreater Dandenong City Council  8.932e-02  1.618e-02    5.521
## CouncilAreaHobsons Bay City Council        3.853e-02  1.648e-02    2.338
## CouncilAreaHume City Council              -2.104e-01  9.194e-03  -22.886
## CouncilAreaKingston City Council           3.901e-01  1.290e-02   30.235
## CouncilAreaKnox City Council               3.145e-02  1.171e-02    2.687
## CouncilAreaMacedon Ranges Shire Council    6.349e-01  3.690e-02   17.204
## CouncilAreaManningham City Council         1.725e-01  8.637e-03   19.978
## CouncilAreaMaribyrnong City Council       -9.914e-02  1.614e-02   -6.143
## CouncilAreaMaroondah City Council          9.162e-02  1.080e-02    8.480
## CouncilAreaMelbourne City Council          2.912e-01  9.674e-03   30.095
## CouncilAreaMelton City Council            -2.987e-01  1.927e-02  -15.502
## CouncilAreaMitchell Shire Council          1.546e-01  5.547e-02    2.787
## CouncilAreaMonash City Council             2.661e-01  8.531e-03   31.195
## CouncilAreaMoonee Valley City Council      1.092e-01  1.566e-02    6.976
## CouncilAreaMoorabool Shire Council         9.905e-02  8.300e-02    1.193
## CouncilAreaMoreland City Council           7.947e-02  8.849e-03    8.981
## CouncilAreaMurrindindi Shire Council       7.364e-01  2.607e-01    2.824
## CouncilAreaNillumbik Shire Council        -3.329e-02  2.894e-02   -1.151
## CouncilAreaPort Phillip City Council       3.234e-01  1.200e-02   26.938
## CouncilAreaStonnington City Council        4.782e-01  1.197e-02   39.937
## CouncilAreaWhitehorse City Council         1.622e-01  9.312e-03   17.423
## CouncilAreaWhittlesea City Council        -1.210e-01  9.363e-03  -12.927
## CouncilAreaWyndham City Council           -3.849e-01  1.586e-02  -24.266
## CouncilAreaYarra City Council              3.269e-01  1.082e-02   30.206
## CouncilAreaYarra Ranges Shire Council      1.386e-01  3.061e-02    4.528
##                                           Pr(>|t|)    
## (Intercept)                                < 2e-16 ***
## Rooms2                                     < 2e-16 ***
## Rooms3                                     < 2e-16 ***
## Rooms4                                     < 2e-16 ***
## Rooms5                                     < 2e-16 ***
## Rooms6                                     < 2e-16 ***
## Rooms7                                     < 2e-16 ***
## Rooms8                                     < 2e-16 ***
## Distance                                   < 2e-16 ***
## Typet                                      < 2e-16 ***
## Typeu                                      < 2e-16 ***
## RegionnameEastern Victoria                 0.00107 ** 
## RegionnameNorthern Metropolitan            < 2e-16 ***
## RegionnameNorthern Victoria               4.01e-05 ***
## RegionnameSouth-Eastern Metropolitan       < 2e-16 ***
## RegionnameSouthern Metropolitan            < 2e-16 ***
## RegionnameWestern Metropolitan             < 2e-16 ***
## RegionnameWestern Victoria                1.51e-06 ***
## Date                                       < 2e-16 ***
## CouncilAreaBayside City Council            < 2e-16 ***
## CouncilAreaBoroondara City Council         < 2e-16 ***
## CouncilAreaBrimbank City Council           < 2e-16 ***
## CouncilAreaCardinia Shire Council         5.04e-11 ***
## CouncilAreaCasey City Council             1.91e-11 ***
## CouncilAreaDarebin City Council            < 2e-16 ***
## CouncilAreaFrankston City Council          < 2e-16 ***
## CouncilAreaGlen Eira City Council          < 2e-16 ***
## CouncilAreaGreater Dandenong City Council 3.39e-08 ***
## CouncilAreaHobsons Bay City Council        0.01941 *  
## CouncilAreaHume City Council               < 2e-16 ***
## CouncilAreaKingston City Council           < 2e-16 ***
## CouncilAreaKnox City Council               0.00722 ** 
## CouncilAreaMacedon Ranges Shire Council    < 2e-16 ***
## CouncilAreaManningham City Council         < 2e-16 ***
## CouncilAreaMaribyrnong City Council       8.15e-10 ***
## CouncilAreaMaroondah City Council          < 2e-16 ***
## CouncilAreaMelbourne City Council          < 2e-16 ***
## CouncilAreaMelton City Council             < 2e-16 ***
## CouncilAreaMitchell Shire Council          0.00532 ** 
## CouncilAreaMonash City Council             < 2e-16 ***
## CouncilAreaMoonee Valley City Council     3.07e-12 ***
## CouncilAreaMoorabool Shire Council         0.23271    
## CouncilAreaMoreland City Council           < 2e-16 ***
## CouncilAreaMurrindindi Shire Council       0.00474 ** 
## CouncilAreaNillumbik Shire Council         0.24990    
## CouncilAreaPort Phillip City Council       < 2e-16 ***
## CouncilAreaStonnington City Council        < 2e-16 ***
## CouncilAreaWhitehorse City Council         < 2e-16 ***
## CouncilAreaWhittlesea City Council         < 2e-16 ***
## CouncilAreaWyndham City Council            < 2e-16 ***
## CouncilAreaYarra City Council              < 2e-16 ***
## CouncilAreaYarra Ranges Shire Council     5.98e-06 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.2592 on 48367 degrees of freedom
## Multiple R-squared:  0.7281, Adjusted R-squared:  0.7278 
## F-statistic:  2539 on 51 and 48367 DF,  p-value: < 2.2e-16
data.frame(R2 = rsquare(model3_no_outliers, data = data_cleaned_no_outliers),
  RMSE = rmse(model3_no_outliers, data = data_cleaned_no_outliers),
  MAE =mae(model3, data = data_cleaned_no_outliers))
##          R2      RMSE       MAE
## 1 0.7280849 0.2590942 0.1976344
glance(model3_no_outliers)
## # A tibble: 1 × 12
##   r.squared adj.r.squared sigma statistic p.value    df logLik   AIC   BIC
##       <dbl>         <dbl> <dbl>     <dbl>   <dbl> <dbl>  <dbl> <dbl> <dbl>
## 1     0.728         0.728 0.259     2539.       0    51 -3311. 6727. 7193.
## # ℹ 3 more variables: deviance <dbl>, df.residual <int>, nobs <int>
ggplot(data_cleaned, aes(x = Distance, y = Log_Price))+
  geom_point()+
  geom_smooth(method = "lm", se = FALSE, color = "blue")+
  geom_smooth(method = "lm", se = FALSE, data = data_cleaned_no_outliers, color = "red")
## `geom_smooth()` using formula = 'y ~ x'
## `geom_smooth()` using formula = 'y ~ x'



Influence of High-Leverage Points To ensure the robustness of the model, an additional analysis was conducted by removing high-leverage points identified in the residuals versus leverage plots. High-leverage points can disproportionately influence the model’s estimates, potentially skewing the results.

However, after removing these influential points and re-running the regression analysis, it was found that the model’s performance metrics, including R-squared, RMSE, and the significance of coefficients, remained largely unchanged. This indicates that the high-leverage points did not have a significant impact on the overall model estimates or conclusions.

The consistency in results suggests that the model is robust and that the relationships identified between the predictors and property prices are reliable, even when outliers are present.

Conclusion

In this comprehensive analysis of the Melbourne housing market, we explored the influence of both intrinsic property characteristics and external factors on property prices. The analysis revealed several key insights:

The regression analysis provided a deeper understanding of the relationships between these variables and property prices, with the model achieving an R² of 0.728, indicating a strong explanatory power. However, further analysis could involve exploring non-linear models or incorporating additional features to improve prediction accuracy.